Hucho record captures in Austrian rivers

Fishermen like to catch record-breaking large fish and if successful will do a bit of effort to document their historic achievement in the local pub ;-) This dataset contains sizes of such record catches of Hucho in various Austrian streams and rivers of various size, the data was collected from various sources including the eventual black-and-white photograph hanging in a pub.

Here is one such picture from the author of the study:

The Danube salmon is an endangered species in now dwindling populations. It needs intact river corridors for migration. Hydropower facilities should have a fish ladder allowing the fish to bypass turbines. The size of such a fish ladder is a cost issue AND an ecological issue - it needs to be large enough to accommodate the expected fish size in any given system. Usually large rivers host large fish. So, river size could be taken as a proxy of fish size to be expected.

hucho = read.table("data/HuchoRatschan2012.txt", header = TRUE)
  1. Look for a reasonable relationship of a river size measure with a fish size measure. Deliver a model that could guide construction of fish passes. Consider transformations to linearize relationships.
names(hucho)
## [1] "river"      "population" "length"     "mass"       "discharge" 
## [6] "width"
plot(length~width,data=hucho)

plot(mass~log(discharge),data=hucho) # I go for this one, but several others may also make sense
humo1<-lm(mass~log(discharge),data=hucho)
summary(humo1)
## 
## Call:
## lm(formula = mass ~ log(discharge), data = hucho)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0650 -3.7538  0.5911  2.5827  9.0719 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     14.8898     1.9967   7.457  2.6e-08 ***
## log(discharge)   2.4026     0.5433   4.422 0.000118 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.908 on 30 degrees of freedom
## Multiple R-squared:  0.3946, Adjusted R-squared:  0.3744 
## F-statistic: 19.56 on 1 and 30 DF,  p-value: 0.0001184
abline(humo1)

coef(humo1)
##    (Intercept) log(discharge) 
##       14.88982        2.40255

Hucho model1: fish mass = 14.9+2.4 * log(Q)

  1. A potential confounding variable could be population size. When a population of Hucho is very small, it is less likely to contain large specimens. The dataset includes an estimate of population size based on expert (= fishermen) opinion. Explore differences in maximum Hucho body size captured in the various systems depending on population size. Test for differences and produce a publishable final graph showing average maximum fish size in dependence from population size. Don´t forget to check assumptions before running your statistical analysis and justify your choice of method.
boxplot(mass~population,data=hucho)

# anova-problem
hist(hucho$mass[hucho$population=="limited"]) # ND will anyway be hard to judge...

qqnorm(hucho$mass[hucho$population=="limited"])

qqnorm(hucho$mass[hucho$population=="medium"])

qqnorm(hucho$mass[hucho$population=="large"])

hist(unlist(tapply(hucho$mass,INDEX=hucho$population,FUN=scale)))

qqnorm(unlist(tapply(hucho$mass,INDEX=hucho$population,FUN=scale)))

shapiro.test(unlist(tapply(hucho$mass,INDEX=hucho$population,FUN=scale))) # check ND
## 
##  Shapiro-Wilk normality test
## 
## data:  unlist(tapply(hucho$mass, INDEX = hucho$population, FUN = scale))
## W = 0.95183, p-value = 0.1624
bartlett.test(mass~population,data=hucho) # check var.hom.
## 
##  Bartlett test of homogeneity of variances
## 
## data:  mass by population
## Bartlett's K-squared = 3.0522, df = 2, p-value = 0.2174
humo2<-lm(mass~population,data=hucho)
summary(humo2)
## 
## Call:
## lm(formula = mass ~ population, data = hucho)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0692 -4.1808 -0.6769  3.5511 12.0308 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         24.785      1.638  15.130 2.69e-15 ***
## populationlimited   -6.643      2.915  -2.279   0.0302 *  
## populationmedium    -1.715      2.317  -0.740   0.4650    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.906 on 29 degrees of freedom
## Multiple R-squared:  0.1527, Adjusted R-squared:  0.09425 
## F-statistic: 2.613 on 2 and 29 DF,  p-value: 0.0905
anova(humo2)
## Analysis of Variance Table
## 
## Response: mass
##            Df  Sum Sq Mean Sq F value Pr(>F)  
## population  2  182.29  91.144  2.6129 0.0905 .
## Residuals  29 1011.61  34.883                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Anova population size does not affect fish size significantly
pairwise.t.test(hucho$mass,hucho$population,method="bonferroni",pool.sd=TRUE)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  hucho$mass and hucho$population 
## 
##         large limited
## limited 0.091 -      
## medium  0.465 0.203  
## 
## P value adjustment method: holm
# pairwise no significant difference!

Population size does not significantly influence maximum fish size (F=2.6, df1=2, df2=29, P=0.09), but there is a tendency for the limited populations to have smaller fish on average.

  1. You will find that both river size and population size affect maximum Hucho size. How can you test both predictors at once? Which would be the best model for fish ladder builders if they wanted to improve the conservation status of Danube salmon in the future?
hucho$population<-factor(hucho$population)
levels(hucho$population) # check order of levels
## [1] "large"   "limited" "medium"
plot(mass~log(discharge),data=hucho,col=hucho$population)
abline(humo1)

boxplot(humo1$residuals~hucho$population) # this residual analysis strongly hints at an effect of population!

humo3<-lm(mass~log(discharge)*population,data=hucho)

summary(humo3)
## 
## Call:
## lm(formula = mass ~ log(discharge) * population, data = hucho)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8704 -3.2566 -0.0332  1.8738  7.7866 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       15.1644     3.5591   4.261 0.000236 ***
## log(discharge)                     2.7973     0.9694   2.886 0.007753 ** 
## populationlimited                  0.2723     5.7999   0.047 0.962911    
## populationmedium                  -0.2760     4.3105  -0.064 0.949445    
## log(discharge):populationlimited  -1.9670     1.6120  -1.220 0.233328    
## log(discharge):populationmedium   -0.2446     1.1679  -0.209 0.835716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.494 on 26 degrees of freedom
## Multiple R-squared:  0.5602, Adjusted R-squared:  0.4757 
## F-statistic: 6.624 on 5 and 26 DF,  p-value: 0.0004239
anova(humo3)
## Analysis of Variance Table
## 
## Response: mass
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## log(discharge)             1 471.13  471.13 23.3305 5.261e-05 ***
## population                 2 163.37   81.68  4.0450   0.02955 *  
## log(discharge):population  2  34.36   17.18  0.8508   0.43863    
## Residuals                 26 525.04   20.19                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
humo4<-lm(mass~log(discharge)+population,data=hucho)
anova(humo4)
## Analysis of Variance Table
## 
## Response: mass
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## log(discharge)  1 471.13  471.13 23.5818 4.11e-05 ***
## population      2 163.37   81.68  4.0886  0.02768 *  
## Residuals      28 559.40   19.98                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(humo4)
## 
## Call:
## lm(formula = mass ~ log(discharge) + population, data = hucho)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8325 -2.7893 -0.0949  1.9198  8.1081 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        16.6708     2.1084   7.907 1.30e-08 ***
## log(discharge)      2.3593     0.4959   4.758 5.38e-05 ***
## populationlimited  -6.2152     2.2079  -2.815  0.00883 ** 
## populationmedium   -1.1626     1.7570  -0.662  0.51357    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.47 on 28 degrees of freedom
## Multiple R-squared:  0.5315, Adjusted R-squared:  0.4812 
## F-statistic: 10.59 on 3 and 28 DF,  p-value: 7.988e-05
plot(mass~log(discharge),data=hucho,col=c("green","red","black"))
abline(a=coef(humo4)[1],b=coef(humo4)[2],col="green")
abline(a=coef(humo4)[1]+coef(humo4)[3],b=coef(humo4)[2],col="red")
abline(a=coef(humo4)[1]+coef(humo4)[4],b=coef(humo4)[2],col="black")

The best model to estimate fish ladder dimension comes from mass~log(Q) using only data from large populations:

Hucho model4: Fish mass = 16.67 +2.36 log(Q)

which has similar slope but higher intercept than the model that ignores population size:

Hucho model1: Fish mass = 14.9+2.4 * log(Q)

With model1 fish would be estimated too small, fish passes would not support the largest fish that we could hope for in eventually healthy populations.

Solid particulate matter in lakes

Load the LakeSPM dataset (Lindström, M., Håkanson, L., Abrahamsson, O., Johansson, H. (1999) An empirical model for prediction of lake water suspended particulate matter. Ecological Modelling 121, 185–198).

lakes <- read.table("data/LakeSPM.txt",header=TRUE)

The study has a set of lakes with SPM=solid particulate matter as the main response of interest. High SPM is a water quality issue, it can be generated through excessive phytoplankton growth or resuspension of sediment in shallow lakes or terrigeneous input into smaller lakes. For water quality prediction a model to predict SPM from easily accessible lake data should be generated. The available predictors (check Excel sheet in data folder for exact meaning of abbreviations) include

  1. Formulate a set of hypotheses testing effects of productivity and lake depth on spm. Choose variables wisely, then run a formal hypothesis test using MLR. A suggestion is: H1: Productive lakes support phytoplankton growth, thus higher spm. H2: Shallow lakes allow resuspension, thus higher spm. Use variables TP and Dm, consider log-transformation (also of spm) to improve linearity.
# choose variables and run MLR
plot(lakes[,c('spm','TP','Dm')])

plot(log(lakes[,c('spm','TP','Dm')]))

  1. These hypotheses are lame ;-) We know that nutrients influence productivity since decades. For water quality forecasting a model to predict spm is needed. Start an explorative approach with the aim to design a good model for spm. Explore the data for potential relationships with the response SPM. Assess distributions and chance for collinearity by graphical means (e.g. using pairs or plot(data)). Consider appropriate transformation to improve linearity of relationships.
plot(lakes[,-c(1:2)])

boxplot(scale(lakes[,-c(1:2)]))

#pH is already a log-transform, Vd seems symmetrically distributed, all others are skewed and are log-transformed
names(lakes)
##  [1] "lakename" "lakeno"   "spm"      "area"     "Dm"       "Dmax"    
##  [7] "pH"       "TP"       "T"        "Q"        "DR"       "Vd"
lakes[,-c(1,2,7,12)]<-log(lakes[,-c(1,2,7,12)])
plot(lakes[,-c(1:2)])

boxplot(scale(lakes[,-c(1:2)]))

# distributions have improved, some linear relationships obvious - with response as well as among predictors (collinearity)
  1. With vif from the car-package you can assess collinearity. Identify redundant variables using VIF and drop them from the dataset.
lakes<-lakes[,-c(1:2)]
mlr.full<-lm(spm~.,data=lakes) # makes a model with all predictors available
library(car)
## Loading required package: carData
vif(mlr.full) #computes VIF
##         area           Dm         Dmax           pH           TP            T 
## 1.259843e+06 9.310628e+05 3.947461e+02 2.337318e+00 3.883131e+00 3.576987e+02 
##            Q           DR           Vd 
## 5.275149e+02 1.034419e+06 4.269601e+01
# check what VIF means:
summary(lm(area~.-spm,data=lakes)) # regressing area on all other predictors gives an R2 of (almost) 1!
## 
## Call:
## lm(formula = area ~ . - spm, data = lakes)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0053403 -0.0008962  0.0003460  0.0008821  0.0040917 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.894e-04  5.712e-02  -0.010    0.992    
## Dm           2.005e+00  1.526e-02 131.403   <2e-16 ***
## Dmax        -2.862e-03  9.258e-03  -0.309    0.761    
## pH           1.243e-04  1.002e-03   0.124    0.903    
## TP           7.409e-05  1.685e-03   0.044    0.965    
## T           -2.094e-04  3.995e-03  -0.052    0.959    
## Q           -1.008e-04  3.802e-03  -0.027    0.979    
## DR           1.999e+00  7.636e-03 261.836   <2e-16 ***
## Vd           9.416e-04  8.395e-03   0.112    0.912    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002301 on 17 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.677e+06 on 8 and 17 DF,  p-value: < 2.2e-16
# some really high VIF point to collinearity, start removing predictors and repeat procedure.

# full model without area
mlr.full<-lm(spm~.-area,data=lakes)
vif(mlr.full)
##         Dm       Dmax         pH         TP          T          Q         DR 
## 915.771551 392.539697   2.335205   3.882689 357.640871 527.493110 256.436037 
##         Vd 
##  42.664441
# full model without area and Dm
mlr.full<-lm(spm~.-area-Dm,data=lakes)
vif(mlr.full)
##       Dmax         pH         TP          T          Q         DR         Vd 
## 224.429091   2.326471   3.707657 130.252655 191.099549  88.934283  27.887176
# -> still high VIFs for some predictors, drop Dmax

# full model without Dm and area and Dmax
mlr.full<-lm(spm~.-area-Dm-Dmax,data=lakes)
vif(mlr.full)
##       pH       TP        T        Q       DR       Vd 
## 2.289150 3.480125 2.393403 2.406980 2.435265 2.249989
# -> now all predictors have VIF<10 and make sense as input for MLR
# note that DR and Vd (which remain included) are lake morphometric measures computed from area, Dm and Dmax!

# delete the unnecessary predictors from the dataframe
which(names(lakes) %in% c("area","Dm","Dmax"))
## [1] 2 3 4
lakes<-lakes[,-c(2,3,4)]
  1. Identify the best simple linear regression model and a full model with only one predictor or all non-redundant ones to start a forward and a backward model building process.
mlr.full<-lm(spm~.,data=lakes) # makes a model with all predictors available
summary(mlr.full)
## 
## Call:
## lm(formula = spm ~ ., data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92577 -0.11178  0.02268  0.20035  0.76549 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -3.60757    1.29121  -2.794  0.01158 * 
## pH           0.25790    0.16681   1.546  0.13860   
## TP           1.02714    0.26827   3.829  0.00113 **
## T           -0.04065    0.05496  -0.740  0.46863   
## Q           -0.02767    0.04319  -0.641  0.52947   
## DR           0.36796    0.12515   2.940  0.00840 **
## Vd           0.31319    0.32422   0.966  0.34619   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3871 on 19 degrees of freedom
## Multiple R-squared:  0.8797, Adjusted R-squared:  0.8417 
## F-statistic: 23.15 on 6 and 19 DF,  p-value: 8.788e-08
# -> this is "maximum" model with lots of probably unimportant predictors, maximum "achievable" r2 is now 0.88

plot(lakes)

simple.mod1<-lm(spm~TP,data=lakes)
summary(simple.mod1)
## 
## Call:
## lm(formula = spm ~ TP, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90449 -0.28309 -0.08777  0.17917  1.20813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.8724     0.5793  -6.685 6.48e-07 ***
## TP            1.5517     0.1891   8.206 2.00e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5089 on 24 degrees of freedom
## Multiple R-squared:  0.7373, Adjusted R-squared:  0.7263 
## F-statistic: 67.35 on 1 and 24 DF,  p-value: 2.004e-08
plot(spm~TP,data=lakes)
abline(simple.mod1,col="red")

# -> very good relationship, high r2
  1. Using the functions add1and drop1 terms can be added to or removed from models. Follow a forward and a backward model building process until a final result and compare the two final models.
add1(simple.mod1,scope=~pH+TP+T+Q+DR+Vd,test="F")
## Single term additions
## 
## Model:
## spm ~ TP
##        Df Sum of Sq    RSS     AIC F value   Pr(>F)   
## <none>              6.2163 -33.204                    
## pH      1   2.02398 4.1923 -41.446 11.1041 0.002896 **
## T       1   0.41143 5.8048 -32.985  1.6302 0.214421   
## Q       1   0.00925 6.2070 -31.243  0.0343 0.854777   
## DR      1   1.96782 4.2484 -41.100 10.6533 0.003414 **
## Vd      1   0.79150 5.4248 -34.745  3.3558 0.079950 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> pH and DR seem to be good candidates for adding to the model
# -> add pH (its inclusion alongside TP causes strongest change in model according to F-value and produces lowest AIC)
mlr1<-lm(spm~TP+pH,data=lakes)
summary(mlr1)
## 
## Call:
## lm(formula = spm ~ TP + pH, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86284 -0.24187 -0.04459  0.18318  0.91888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.5089     0.9285  -7.010 3.83e-07 ***
## TP            1.4664     0.1607   9.127 4.16e-09 ***
## pH            0.4105     0.1232   3.332   0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4269 on 23 degrees of freedom
## Multiple R-squared:  0.8228, Adjusted R-squared:  0.8074 
## F-statistic:  53.4 on 2 and 23 DF,  p-value: 2.275e-09
# -> good model, try adding more terms
add1(mlr1,scope~pH+TP+T+Q+DR+Vd,test="F")
## Single term additions
## 
## Model:
## spm ~ TP + pH
##        Df Sum of Sq    RSS     AIC F value  Pr(>F)  
## <none>              4.1923 -41.446                  
## T       1   0.00252 4.1898 -39.462  0.0132 0.90954  
## Q       1   0.02363 4.1686 -39.593  0.1247 0.72735  
## DR      1   0.92796 3.2643 -45.951  6.2540 0.02033 *
## Vd      1   0.03692 4.1554 -39.676  0.1955 0.66270  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> add DR
mlr2<-lm(spm~TP+pH+DR,data=lakes)
summary(mlr2)
## 
## Call:
## lm(formula = spm ~ TP + pH + DR, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94258 -0.17068  0.02951  0.18948  0.85895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.4940     1.1623  -3.866 0.000835 ***
## TP            1.1434     0.1942   5.889 6.32e-06 ***
## pH            0.3058     0.1188   2.575 0.017261 *  
## DR            0.2871     0.1148   2.501 0.020332 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3852 on 22 degrees of freedom
## Multiple R-squared:  0.862,  Adjusted R-squared:  0.8432 
## F-statistic: 45.82 on 3 and 22 DF,  p-value: 1.247e-09
# -> good model, all terms significant try adding more terms
add1(mlr2,scope~pH+TP+T+Q+DR+Vd,test="F")
## Single term additions
## 
## Model:
## spm ~ TP + pH + DR
##        Df Sum of Sq    RSS     AIC F value Pr(>F)
## <none>              3.2643 -45.951               
## T       1  0.009151 3.2552 -44.024  0.0590 0.8104
## Q       1  0.225708 3.0386 -45.814  1.5599 0.2254
## Vd      1  0.291287 2.9730 -46.381  2.0575 0.1662
# -> no  more terms to add according to F -> mlr2 is the final model (as in the original publication)
# -> not that according to AIC we could add Vd and keep building more complicated model

# alternatively with dropping terms from the full model
summary(mlr.full)
## 
## Call:
## lm(formula = spm ~ ., data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92577 -0.11178  0.02268  0.20035  0.76549 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -3.60757    1.29121  -2.794  0.01158 * 
## pH           0.25790    0.16681   1.546  0.13860   
## TP           1.02714    0.26827   3.829  0.00113 **
## T           -0.04065    0.05496  -0.740  0.46863   
## Q           -0.02767    0.04319  -0.641  0.52947   
## DR           0.36796    0.12515   2.940  0.00840 **
## Vd           0.31319    0.32422   0.966  0.34619   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3871 on 19 degrees of freedom
## Multiple R-squared:  0.8797, Adjusted R-squared:  0.8417 
## F-statistic: 23.15 on 6 and 19 DF,  p-value: 8.788e-08
# -> unreasonable model with many insignificant terms
drop1(mlr.full,test="F")
## Single term deletions
## 
## Model:
## spm ~ pH + TP + T + Q + DR + Vd
##        Df Sum of Sq    RSS     AIC F value   Pr(>F)   
## <none>              2.8466 -43.511                    
## pH      1   0.35810 3.2047 -42.430  2.3901 0.138597   
## TP      1   2.19631 5.0429 -30.643 14.6594 0.001133 **
## T       1   0.08194 2.9286 -44.773  0.5469 0.468633   
## Q       1   0.06147 2.9081 -44.955  0.4103 0.529466   
## DR      1   1.29509 4.1417 -35.762  8.6441 0.008404 **
## Vd      1   0.13980 2.9864 -44.264  0.9331 0.346193   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> drop Q (lowest F, highest P, also makes lowest AIC)
mlr3<-lm(spm~pH+TP+T+DR+Vd,data=lakes)
summary(mlr3)
## 
## Call:
## lm(formula = spm ~ pH + TP + T + DR + Vd, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88293 -0.20247  0.00856  0.23363  0.81997 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -3.80213    1.23633  -3.075  0.00597 **
## pH           0.21254    0.14880   1.428  0.16861   
## TP           0.98497    0.25620   3.844  0.00101 **
## T           -0.03584    0.05364  -0.668  0.51164   
## DR           0.35463    0.12158   2.917  0.00853 **
## Vd           0.42131    0.27270   1.545  0.13804   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3813 on 20 degrees of freedom
## Multiple R-squared:  0.8771, Adjusted R-squared:  0.8464 
## F-statistic: 28.54 on 5 and 20 DF,  p-value: 1.869e-08
drop1(mlr3,test="F")
## Single term deletions
## 
## Model:
## spm ~ pH + TP + T + DR + Vd
##        Df Sum of Sq    RSS     AIC F value   Pr(>F)   
## <none>              2.9081 -44.955                    
## pH      1   0.29667 3.2048 -44.430  2.0403 0.168611   
## TP      1   2.14909 5.0572 -32.569 14.7800 0.001012 **
## T       1   0.06492 2.9730 -46.381  0.4465 0.511638   
## DR      1   1.23717 4.1453 -37.739  8.5084 0.008526 **
## Vd      1   0.34706 3.2552 -44.024  2.3868 0.138038   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> drop T
mlr4<-lm(spm~pH+TP+DR+Vd,data=lakes)
summary(mlr4)
## 
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82209 -0.18123 -0.00184  0.21378  0.85912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.8703     1.2158  -3.183  0.00447 ** 
## pH            0.1878     0.1422   1.321  0.20081    
## TP            1.0959     0.1925   5.693 1.19e-05 ***
## DR            0.3433     0.1188   2.890  0.00876 ** 
## Vd            0.3709     0.2586   1.434  0.16618    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3763 on 21 degrees of freedom
## Multiple R-squared:  0.8743, Adjusted R-squared:  0.8504 
## F-statistic: 36.53 on 4 and 21 DF,  p-value: 3.542e-09
drop1(mlr4,test="F")
## Single term deletions
## 
## Model:
## spm ~ pH + TP + DR + Vd
##        Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>              2.9730 -46.381                      
## pH      1    0.2469 3.2200 -46.307  1.7443  0.200810    
## TP      1    4.5877 7.5608 -24.113 32.4054 1.192e-05 ***
## DR      1    1.1823 4.1554 -39.676  8.3513  0.008764 ** 
## Vd      1    0.2913 3.2643 -45.951  2.0575  0.166183    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# according to F we should drop pH, according to AIC this is already the best model
# rule of thumb: AIC differences <2 are not reliable, thus drop pH
mlr5<-lm(spm~TP+DR+Vd,data=lakes)
summary(mlr5)
## 
## Call:
## lm(formula = spm ~ TP + DR + Vd, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78084 -0.20451 -0.01827  0.19437  0.88301 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.5364     0.6881  -3.686 0.001293 ** 
## TP            1.0455     0.1919   5.449 1.79e-05 ***
## DR            0.4158     0.1071   3.881 0.000805 ***
## Vd            0.5685     0.2145   2.651 0.014600 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3826 on 22 degrees of freedom
## Multiple R-squared:  0.8639, Adjusted R-squared:  0.8453 
## F-statistic: 46.55 on 3 and 22 DF,  p-value: 1.074e-09
drop1(mlr5,test="F")
## Single term deletions
## 
## Model:
## spm ~ TP + DR + Vd
##        Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>              3.2200 -46.307                      
## TP      1    4.3462 7.5662 -26.095 29.6947 1.788e-05 ***
## DR      1    2.2048 5.4248 -34.745 15.0638 0.0008054 ***
## Vd      1    1.0285 4.2484 -41.100  7.0268 0.0146000 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> no more terms to drop
# -> final model is different between forward and backward variable selection, even though 2 of 3 predictors are identical!
  1. Use step to identify a good model based on the AIC. How does this compare to the models created above? Check model assumptions and decide for a final model.
step(mlr.full,direction="backward")
## Start:  AIC=-43.51
## spm ~ pH + TP + T + Q + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## - Q     1   0.06147 2.9081 -44.955
## - T     1   0.08194 2.9286 -44.773
## - Vd    1   0.13980 2.9864 -44.264
## <none>              2.8466 -43.511
## - pH    1   0.35810 3.2047 -42.430
## - DR    1   1.29509 4.1417 -35.762
## - TP    1   2.19631 5.0429 -30.643
## 
## Step:  AIC=-44.96
## spm ~ pH + TP + T + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## - T     1   0.06492 2.9730 -46.381
## <none>              2.9081 -44.955
## - pH    1   0.29667 3.2048 -44.430
## - Vd    1   0.34706 3.2552 -44.024
## - DR    1   1.23717 4.1453 -37.739
## - TP    1   2.14909 5.0572 -32.569
## 
## Step:  AIC=-46.38
## spm ~ pH + TP + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## <none>              2.9730 -46.381
## - pH    1    0.2469 3.2200 -46.307
## - Vd    1    0.2913 3.2643 -45.951
## - DR    1    1.1823 4.1554 -39.676
## - TP    1    4.5877 7.5608 -24.113
## 
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = lakes)
## 
## Coefficients:
## (Intercept)           pH           TP           DR           Vd  
##     -3.8703       0.1878       1.0959       0.3433       0.3709
step(simple.mod1,scope=~pH+TP+T+Q+DR+Vd,direction="forward")
## Start:  AIC=-33.2
## spm ~ TP
## 
##        Df Sum of Sq    RSS     AIC
## + pH    1   2.02398 4.1923 -41.446
## + DR    1   1.96782 4.2484 -41.100
## + Vd    1   0.79150 5.4248 -34.745
## <none>              6.2163 -33.204
## + T     1   0.41143 5.8048 -32.985
## + Q     1   0.00925 6.2070 -31.243
## 
## Step:  AIC=-41.45
## spm ~ TP + pH
## 
##        Df Sum of Sq    RSS     AIC
## + DR    1   0.92796 3.2643 -45.951
## <none>              4.1923 -41.446
## + Vd    1   0.03692 4.1554 -39.676
## + Q     1   0.02363 4.1686 -39.593
## + T     1   0.00252 4.1898 -39.462
## 
## Step:  AIC=-45.95
## spm ~ TP + pH + DR
## 
##        Df Sum of Sq    RSS     AIC
## + Vd    1  0.291287 2.9730 -46.381
## <none>              3.2643 -45.951
## + Q     1  0.225708 3.0386 -45.814
## + T     1  0.009151 3.2552 -44.024
## 
## Step:  AIC=-46.38
## spm ~ TP + pH + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## <none>              2.9730 -46.381
## + T     1  0.064924 2.9081 -44.955
## + Q     1  0.044461 2.9286 -44.773
## 
## Call:
## lm(formula = spm ~ TP + pH + DR + Vd, data = lakes)
## 
## Coefficients:
## (Intercept)           TP           pH           DR           Vd  
##     -3.8703       1.0959       0.1878       0.3433       0.3709
# same solutions! compare with selection by F!

# combined forward and backward selection based on AIC
step(mlr.full,direction="both")
## Start:  AIC=-43.51
## spm ~ pH + TP + T + Q + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## - Q     1   0.06147 2.9081 -44.955
## - T     1   0.08194 2.9286 -44.773
## - Vd    1   0.13980 2.9864 -44.264
## <none>              2.8466 -43.511
## - pH    1   0.35810 3.2047 -42.430
## - DR    1   1.29509 4.1417 -35.762
## - TP    1   2.19631 5.0429 -30.643
## 
## Step:  AIC=-44.96
## spm ~ pH + TP + T + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## - T     1   0.06492 2.9730 -46.381
## <none>              2.9081 -44.955
## - pH    1   0.29667 3.2048 -44.430
## - Vd    1   0.34706 3.2552 -44.024
## + Q     1   0.06147 2.8466 -43.511
## - DR    1   1.23717 4.1453 -37.739
## - TP    1   2.14909 5.0572 -32.569
## 
## Step:  AIC=-46.38
## spm ~ pH + TP + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## <none>              2.9730 -46.381
## - pH    1    0.2469 3.2200 -46.307
## - Vd    1    0.2913 3.2643 -45.951
## + T     1    0.0649 2.9081 -44.955
## + Q     1    0.0445 2.9286 -44.773
## - DR    1    1.1823 4.1554 -39.676
## - TP    1    4.5877 7.5608 -24.113
## 
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = lakes)
## 
## Coefficients:
## (Intercept)           pH           TP           DR           Vd  
##     -3.8703       0.1878       1.0959       0.3433       0.3709
mlr8<-lm(spm~pH+TP+DR+Vd,data=lakes)

# assess the final model candidates: mlr2, mlr5 and mlr8
summary(mlr2)
## 
## Call:
## lm(formula = spm ~ TP + pH + DR, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94258 -0.17068  0.02951  0.18948  0.85895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.4940     1.1623  -3.866 0.000835 ***
## TP            1.1434     0.1942   5.889 6.32e-06 ***
## pH            0.3058     0.1188   2.575 0.017261 *  
## DR            0.2871     0.1148   2.501 0.020332 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3852 on 22 degrees of freedom
## Multiple R-squared:  0.862,  Adjusted R-squared:  0.8432 
## F-statistic: 45.82 on 3 and 22 DF,  p-value: 1.247e-09
summary(mlr5)
## 
## Call:
## lm(formula = spm ~ TP + DR + Vd, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78084 -0.20451 -0.01827  0.19437  0.88301 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.5364     0.6881  -3.686 0.001293 ** 
## TP            1.0455     0.1919   5.449 1.79e-05 ***
## DR            0.4158     0.1071   3.881 0.000805 ***
## Vd            0.5685     0.2145   2.651 0.014600 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3826 on 22 degrees of freedom
## Multiple R-squared:  0.8639, Adjusted R-squared:  0.8453 
## F-statistic: 46.55 on 3 and 22 DF,  p-value: 1.074e-09
summary(mlr8)
## 
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = lakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82209 -0.18123 -0.00184  0.21378  0.85912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.8703     1.2158  -3.183  0.00447 ** 
## pH            0.1878     0.1422   1.321  0.20081    
## TP            1.0959     0.1925   5.693 1.19e-05 ***
## DR            0.3433     0.1188   2.890  0.00876 ** 
## Vd            0.3709     0.2586   1.434  0.16618    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3763 on 21 degrees of freedom
## Multiple R-squared:  0.8743, Adjusted R-squared:  0.8504 
## F-statistic: 36.53 on 4 and 21 DF,  p-value: 3.542e-09
extractAIC(mlr2)
## [1]   4.00000 -45.95116
extractAIC(mlr5)
## [1]   4.00000 -46.30677
extractAIC(mlr8)
## [1]   5.00000 -46.38135
# -> r2 and AIC are not very different, not enough reason to use more complex model (4 vs. 3 variables!)
# AIC differences between models <2 indicate "equivalent" models
# -> choose mlr2 or mlr5
  1. The model is intended to be used for prediction. Test its prediction capacity using a leave-1-out approach. Leave-1-out means to predict the response for each observation with a model built without this observation. For this, pick your most favourite model built above. Use a loop to leave out 1 case at a time, then predict for the left-out case. Finally compare observed vs. (unbiased) predictions and plot on a 1:1 graph.
# will use mlr2
i = 1
ypred_cv<-rep(NA,nrow(lakes))
for(i in 1:nrow(lakes)) {
  mod <- lm(spm ~ TP + pH + DR, data = lakes[-i, ])
  ypred_cv[i]<-predict(mod, newdata = lakes[i,])
}
plot(lakes$spm,ypred_cv)
abline(a=0,b=1,col="red")

cor.test(lakes$spm,ypred_cv)
## 
##  Pearson's product-moment correlation
## 
## data:  lakes$spm and ypred_cv
## t = 10.129, df = 24, p-value = 3.822e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7874937 0.9546839
## sample estimates:
##       cor 
## 0.9002392

conclude: mlr2 is able to make reasonably good predictons even for left-out cases

Mortality/Abundance of sugar maple

These data are from a study of the response of trees in North America to climate change (Talluto, et al. (2017) Extinction debt and colonization credit delay range shifts of eastern North American trees. Nat Ecol Evol 1, 0182 (2017). https://doi.org/10.1038/s41559-017-0182).

Load the somewhat deconstructed dataset, containing climate data, data on tree abundance and mortality, and a species table:

clim = read.csv("data/tree_clim.csv")
trees = read.csv("data/trees.csv")
species = read.csv("data/tree_species.csv")

We have two potential hypotheses to explore:

1: Mortality of sugar maple (Acer saccharum) is negatively related to temperature and/or climatic variability.

2: Abundance of sugar maple has a hump-shaped (quadratic) relationship to climate (i.e., there is a climatic optimum)

Choose one of the two to work on. 1 implies a binomial process (k trees died out of n trees present), and will have cbind(died, n) as the response variable. 2 implies a poisson process and will have n as the response. Formulate a more specific alternative hypothesis.

Reshaping

The data will need to be reshaped a bit before use.

  1. Explore the climate data. Use reshape2::dcast to produce a “wide” format dataset. You have 6 variables, measured in plots across multiple years.

Measures of climate location

  • annual_mean_temp
  • mean_temp_wettest_quarter
  • tot_annual_pp (precipitation)
  • pp_warmest_quarter

Measures of climate variability

  • mean_diurnal_range
  • pp_seasonality

Multicollinearity is usually a problem with climate variables, because many variables measure very similar things. However, the data set is very large, and so variance inflation is somewhat less of a problem. You should still choose a smaller number of variables to test (max 4); justify this choice based on a hypothesis. You can use cor(clim[, -c(1,2)]) to make a correlation matrix; if two variables have a correlation > 0.6, you should only use one of them. You may (optionally) also check VIFs, but you should do so after fitting the model.

library(reshape2)
clim_wide = dcast(clim, plot_id+year_measured ~ variable)
cor(clim_wide[,-(1:2)])
##                           annual_mean_temp mean_diurnal_range
## annual_mean_temp                 1.0000000         0.56455636
## mean_diurnal_range               0.5645564         1.00000000
## mean_temp_wettest_quarter        0.6576496         0.44545131
## pp_seasonality                   0.3349464         0.28917816
## pp_warmest_quarter               0.2391688         0.08772849
## tot_annual_pp                    0.5952070         0.25246700
##                           mean_temp_wettest_quarter pp_seasonality
## annual_mean_temp                         0.65764958     0.33494644
## mean_diurnal_range                       0.44545131     0.28917816
## mean_temp_wettest_quarter                1.00000000     0.66255407
## pp_seasonality                           0.66255407     1.00000000
## pp_warmest_quarter                       0.40558332     0.04250791
## tot_annual_pp                            0.04818496    -0.31387628
##                           pp_warmest_quarter tot_annual_pp
## annual_mean_temp                  0.23916885    0.59520697
## mean_diurnal_range                0.08772849    0.25246700
## mean_temp_wettest_quarter         0.40558332    0.04818496
## pp_seasonality                    0.04250791   -0.31387628
## pp_warmest_quarter                1.00000000    0.48682271
## tot_annual_pp                     0.48682271    1.00000000

I will select 4 variables, 1 each of temperature and precipitation, location and variability. I start with the variability, since there are only 2 variables I select them both (mean_diurnal_range and pp_seasonality). For temperature location, mean_temp_wettest_quarter is strongly correlated with pp_seasonaility, so I go with annual_mean_temp. For precipitation, pp_warmest_quarter is the most unique variable remaining, and has the added benefit of having a nice interpretation with regard to drought stress (low pp_warmest_quarter = high drought stress).

## note that it is less error-prone if you refer to variables by name.
vars = c("plot_id", "year_measured", "annual_mean_temp", "mean_diurnal_range", "pp_seasonality", "pp_warmest_quarter")
my_clim_vars = clim_wide[, vars]
  1. Merge the climate data with the tree data (as shown in the lecture), and filter the data to only include Acer saccharum
trees = merge(trees, my_clim_vars, by.x = c("plot", "year"), 
              by.y= c("plot_id", "year_measured"))
trees = subset(trees, species_code == "28731-ACE-SAC") ## this is the species code for sugar maple
  1. Fit GLMs (using the glm function with the family = poisson or family = 'binomial' option, depending on your response). Use AIC(mod) to compare different hypotheses about which variables to include. Choose a final ‘best’ model, and report the results using appropriate tables, figures, etc. If you use a poisson model, be sure to check for overdispersion in each of your models (it is not necessary to correct if it is too high in your final model, but do report it).

Hypothesis 1

Binomial models need to know what the count of events is (here, deaths, because we are modelling mortality), but they also need to know the number of “trials;” in this case, we can’t observe more deaths than there were actual trees, so the variable is n. We use this to make a combined response variable: cbind(died, n)

## biggest model, including all variables
mod_h1_1 = glm(cbind(died, n) ~ annual_mean_temp + mean_diurnal_range + pp_seasonality + pp_warmest_quarter, data = trees, family=binomial)
## only location, no variability
mod_h1_2 = glm(cbind(died, n) ~ annual_mean_temp + pp_warmest_quarter, data = trees, family=binomial)
## variability only
mod_h1_3 = glm(cbind(died, n) ~ mean_diurnal_range + pp_seasonality, data = trees, family=binomial)
## temperature only
mod_h1_4 = glm(cbind(died, n) ~ annual_mean_temp + mean_diurnal_range, data = trees, family=binomial)
## With all 2-way interactions
mod_h1_int = glm(cbind(died, n) ~ (annual_mean_temp + mean_diurnal_range + pp_seasonality + pp_warmest_quarter)^2, data = trees, family=binomial)


mods_h1 = data.frame(model = c("mod_h1_1", "mod_h1_2", "mod_h1_3", "mod_h1_4", "mod_h1_int"), 
                     aic = c(AIC(mod_h1_1), AIC(mod_h1_2), AIC(mod_h1_3), 
                            AIC(mod_h1_4), AIC(mod_h1_int)))
mods_h1$dAIC = mods_h1$aic - min(mods_h1$aic)
mods_h1
##        model      aic      dAIC
## 1   mod_h1_1 15645.74  80.26063
## 2   mod_h1_2 15730.74 165.26185
## 3   mod_h1_3 15665.49 100.01677
## 4   mod_h1_4 15751.62 186.14545
## 5 mod_h1_int 15565.48   0.00000
coef(mod_h1_1)
##        (Intercept)   annual_mean_temp mean_diurnal_range     pp_seasonality 
##       -2.577322327        0.011936138        0.045388491        0.011135541 
## pp_warmest_quarter 
##       -0.001051458
new_trees = trees
new_trees[, c("annual_mean_temp", "mean_diurnal_range", "pp_seasonality", "pp_warmest_quarter")] = 
    scale(new_trees[, c("annual_mean_temp", "mean_diurnal_range", "pp_seasonality", "pp_warmest_quarter")])
mod_h1_1_sc = glm(cbind(died, n) ~ annual_mean_temp + mean_diurnal_range + pp_seasonality + pp_warmest_quarter, data = new_trees, family=binomial)

The first model, with all 4 variables, is best. looking at the coefficients, we find that more variability in climate results in increased mortality. We can make response curves by generating some fake data. We set other variables to their mean when computing the predictions.

x_plot = data.frame(annual_mean_temp = seq(min(trees$annual_mean_temp), max(trees$annual_mean_temp), length.out = 500), 
                    mean_diurnal_range = mean(trees$mean_diurnal_range), pp_seasonality = mean(trees$pp_seasonality), 
                    pp_warmest_quarter = mean(trees$pp_warmest_quarter))
# use the predict function to find out what we expect for mortality if x takes hypothetical values
# type = 'response' takes care of transforming away the link function, otherwise we would get predictions on the logit scale
y_plot = predict(mod_h1_1, type='response', newdata = x_plot)

par(mfrow = c(2,2), bty='n')
plot(x_plot$annual_mean_temp, y_plot, type='l', xlab='Annual Mean Temperature', ylab = "Mortality Probability", ylim=c(0,1), col='blue', lwd=2)
points(trees$annual_mean_temp, trees$died/trees$n, pch=16, cex=0.4)

## now just edit the appropriate columns to make the rest of the plots
for(i in 2:4) {
    nm = colnames(x_plot)[i]
    nm_pr = colnames(x_plot)[i-1]
    x_plot[,nm_pr] = mean(trees[,nm_pr]) # set the previous column to the mean
    # set the next to a sequence
    x_plot[, nm] = seq(min(trees[,nm]), max(trees[,nm]), length.out = 500)
    y_plot = predict(mod_h1_1, type='response', newdata = x_plot)
    plot(x_plot[,nm], y_plot, type='l', xlab=nm, ylab = "Mortality Probability", ylim=c(0,1), col='blue', lwd=2)
    points(trees[,nm], trees$died/trees$n, pch=16, cex=0.4)

}

Hypothesis 2

Here we check for hump-shaped relationships between the number of trees and climate. The number of trees is a count, so we use a Poisson model. I choose only to include quadratic terms for climate location; it makes sense that variability is either good or bad, not necessarily that there is an interpediate optimum.

## biggest model, including all variables and a squared term for location
mod_h2_1 = glm(n  ~ annual_mean_temp + mean_diurnal_range + pp_seasonality + pp_warmest_quarter + I(annual_mean_temp^2) + I(pp_warmest_quarter^2), data = trees, family=poisson)
## only location, no variability
mod_h2_2 = glm(n ~ annual_mean_temp + pp_warmest_quarter + I(annual_mean_temp^2) + I(pp_warmest_quarter^2), data = trees, family=poisson)
## location, linear only
mod_h2_3 = glm(n ~ annual_mean_temp + pp_warmest_quarter, data = trees, family=poisson)
## temperature only
mod_h2_4 = glm(n  ~ annual_mean_temp + mean_diurnal_range + I(annual_mean_temp^2), data = trees, family=poisson)

mods_h2 = data.frame(model = c("mod_h2_1", "mod_h2_2", "mod_h2_3", "mod_h2_4"), 
                     aic = c(AIC(mod_h2_1), AIC(mod_h2_2), AIC(mod_h2_3), AIC(mod_h2_4)))
mods_h2$dAIC = mods_h2$aic - min(mods_h2$aic)
mods_h2
##      model      aic       dAIC
## 1 mod_h2_1 42017.68    0.00000
## 2 mod_h2_2 42066.10   48.42201
## 3 mod_h2_3 42386.89  369.20384
## 4 mod_h2_4 43720.70 1703.02185
coef(mod_h2_1)
##             (Intercept)        annual_mean_temp      mean_diurnal_range 
##            7.055290e+00           -1.087067e-01            5.738812e-02 
##          pp_seasonality      pp_warmest_quarter   I(annual_mean_temp^2) 
##            1.320556e-03           -2.999770e-02            1.426522e-03 
## I(pp_warmest_quarter^2) 
##            4.416641e-05

The first model, with all 4 variables, is by far the best. looking at the coefficients, we find that more variability in climate results in increased mortality. Interpreting the coefficients directly with such a model is difficult, we will need to check the response curves.

x_plot = data.frame(annual_mean_temp = seq(min(trees$annual_mean_temp), max(trees$annual_mean_temp), length.out = 500), 
                    mean_diurnal_range = mean(trees$mean_diurnal_range), pp_seasonality = mean(trees$pp_seasonality), 
                    pp_warmest_quarter = mean(trees$pp_warmest_quarter))
# use the predict function to find out what we expect for mortality if x takes hypothetical values
# type = 'response' takes care of transforming away the link function, otherwise we would get predictions on the logit scale
y_plot = predict(mod_h2_1, type='response', newdata = x_plot)

par(mfrow = c(2,2), bty='n')
plot(x_plot$annual_mean_temp, y_plot, type='l', xlab='Annual Mean Temperature', ylab = "Expected Number of Trees")

## now just edit the appropriate columns to make the rest of the plots
for(i in 2:4) {
    nm = colnames(x_plot)[i]
    nm_pr = colnames(x_plot)[i-1]
    x_plot[,nm_pr] = mean(trees[,nm_pr]) # set the previous column to the mean
    # set the next to a sequence
    x_plot[, nm] = seq(min(trees[,nm]), max(trees[,nm]), length.out = 500)
    y_plot = predict(mod_h2_1, type='response', newdata = x_plot)
    plot(x_plot[,nm], y_plot, type='l', xlab=nm, ylab = "Expected Number of Trees")
}

The quadratic terms produce some rather unexpected predictions. In particular, not only do we find no evidence of an optimum, but there is an anti-optimum with regards to precipitation! The model predicts the least number of trees at intermediate precipitations. Likely we have not done an adequate job building this model, and so we will need to explore further hypotheses…

But for the moment, we have another problem. Poisson models assume the variance is equal to the mean. We can check this with the “dispersion parameter,” which should be equal to one. Compute it as the ratio of residual deviance to the degrees of freedom. For comparison, we can do this for all models.

mod_h2_1$deviance / mod_h2_1$df.residual
## [1] 4.675947
mod_h2_2$deviance / mod_h2_2$df.residual
## [1] 4.684589
mod_h2_3$deviance / mod_h2_3$df.residual
## [1] 4.747849
mod_h2_4$deviance / mod_h2_4$df.residual
## [1] 5.015966

As the models get simpler, the dispersion parameter gets worse (simpler models describe less deviance). In all cases, the dispersion parameters are quite a bit greater than one, so (in a complete analysis), we would need to consider more variables, or consider another model structure. Unfortunately, the counts are much too close to zero to use a normal approximation.

hist(trees$n)